-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[LoRA] LoRA cuda graph specialization #25914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoRA] LoRA cuda graph specialization #25914
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
a0ed199 to
00d1328
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a specialization for CUDA graphs with LoRA to optimize performance. The key idea is to capture two distinct sets of CUDA graphs: one for batches with active LoRA adapters and another for batches without. This avoids the overhead associated with LoRA operations when no LoRA adapters are in use. The changes include adding a has_lora attribute to BatchDescriptor for dispatching, modifying the CUDA graph capture logic to handle both scenarios, and moving the zeroing of the intermediate LoRA buffer to be conditional on LoRA activation. The implementation appears solid and the provided test results demonstrate a significant reduction in overhead, from 10.5% down to 1.4%, when LoRA is enabled but not active. The code changes are consistent and correctly implement the intended optimization. I have not found any critical or high-severity issues.
|
Thank you for submitting this pull request, @andylolu2! I've reviewed the changes and appreciate the clear problem statement and detailed implementation notes. This feature significantly improves efficiency by specializing CUDA graphs for LoRA and non-LoRA scenarios. The reported reduction in overhead from 10.5% to 1.4% is a substantial gain, directly addressing the stated purpose of minimizing overhead when LoRA is enabled but not actively used. Key observations from the review:
Overall, this is a valuable enhancement that improves the performance characteristics of LoRA integration. Great work! |
|
Thanks @andylolu2 . cc @jeejeelee |
vllm/v1/cudagraph_dispatcher.py
Outdated
| self.add_cudagraph_key( | ||
| cudagraph_mode.mixed_mode(), | ||
| BatchDescriptor(num_tokens=bs, uniform_decode=False)) | ||
| for has_lora in [True, False]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ:Will this increase the memory consumption of the CUDA graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does, but not by too much.
Before PR: (with --enable-lora)
Free memory ... 1.62 GiB for CUDAGraph memory.
After PR: (with --enable-lora)
Free memory ... 2.38 GiB for CUDAGraph memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andylolu2 can you also try this with just the base model (i.e. not enabling lora) to see that it doesn't affect the CUDA graph memory ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baseline is: (without --enable-lora)
Free memory ... 1.14 GiB for CUDAGraph memory.
Signed-off-by: Andy Lo <[email protected]>
Signed-off-by: Andy Lo <[email protected]>
Signed-off-by: Andy Lo <[email protected]>
00d1328 to
b027fd2
Compare
Signed-off-by: Andy Lo <[email protected]>
|
Cc @fhl2000 can you take a look? |
fhl2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the cudagraph dispatching stuff. Only some small thoughts on the Lora path.
Signed-off-by: Andy Lo <[email protected]>
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @ProExpertProg please take another look
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
|
@ProExpertProg @jeejeelee I see CI failures but seemingly unrelated. I scanned through them and all failures seem to be caused by |
|
Let me sync with main and test again |
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Merged 8 commits from origin/main including: - PR vllm-project#26586: Eagle rejection sampler fix (previously cherry-picked) - LoRA CUDA graph specialization (vllm-project#25914) - Bee-8B VLM model support (vllm-project#27012) - Utilities reorganization (network_utils, async_utils, etc.) - Multiple bug fixes and improvements In-Tree Modifications: - Removed Eagle rejection sampler cherry-pick (now in upstream) - Kept Qwen3 tool parser fix (still needed, line 523) - Only 1 active in-tree modification remaining Plugin Compatibility: - All 10 plugin patches load successfully - No target class changes required - Clean merge with no conflicts Documentation Updates: - Updated IN_TREE_MODIFICATIONS.md (moved Eagle fix to Removed/Obsolete) - Updated CLAUDE.md merge history - Verified clean diff with origin/main (3 files, all documented) Signed-off-by: Pradyun Ramadorai <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Purpose
Currently when enabling LoRA (with cuda graphs, which is necessary for reasonable speed) adds overhead to the normal inference path, even if there are no active LoRA adapters. This is because we currently only capture cuda graphs with LoRA operations included.
In this PR, I make some small changes to capture a different set of cuda graphs when there are no active LoRA adapters, so we get exactly the same speed as normal inference when there are no active LoRA requests.
Implementation
has_loraattribute toBatchDescriptor.len(self.input_batch.lora_id_to_lora_request) > 0..zero()of the intermediate lora buffer to insidelora_shrinkthis way it will be skipped when there's no LoRAs active.Test Plan
Show that LoRA still functionally works, but has zero-overhead when there's no active LoRAs.
Test Result
--enable-loraoverhead reduced from 10.5% to 1.4%. I compared the kernels launched and they are identical to the--no-enable-loracase when there's no active LoRAs, so I suspect the 1.4% overhead is just from additional CPU-side logic.Baseline
Before PR
After PR
Functionality test
still passes (it uses cuda graphs).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.